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Abstract 

A biological pathway represents a set of genes that serves a particular cellular or a 
physiological function. The genes within the same pathway are expected to function 
together and hence may interact with each other. It is also known that many genes, 
and so pathways, interact with other environmental variables. However, no formal 
procedure has yet been developed to evaluate the pathway-environment interaction. In 
this article, we propose a semiparametric method to model the pathway-environment 
interaction. The method connects a least square kernel machine and a semiparametric 
mixed effects model. We model nonparametrically the environmental effect via a nat- 
ural cubic spline. Both a pathway effect and an interaction between a pathway and 
an environmental effect are modeled nonparametrically via a kernel machine, and we 
estimate variance component representing an interaction effect under a semiparametric 
mixed effects model. We then employ a restricted likelihood ratio test and a score test 
to evaluate the main pathway effect and the pathway-environment interaction. The 
approach was applied to a genetic pathway data of Type II diabetes, and pathways with 
either a significant main pathway effect, an interaction effect or both were identified. 
Other methods previously developed determined many as having a significant main 
pathway effect only. Furthermore, among those significant pathways, we discovered 
some pathways having a significant pathway-environment interaction effect, a result 
that other methods would not be able to detect. 

Keywords: environmental variable; Gaussian random process; Kernel machine; Path- 
way analysis; Semiparametric model; Smoothing splines 

Running Title: Semiparametric Method for Evaluating Pathway-Environment Inter- 
action 



1 Introduction 



Gene-related diseases are complex processes associated not only with specific gene or gene 
sets but also with gene-gene and gene-environment interaction. For decades, statistical 
methods have focused on analyzing microarray data based on single genes or single-nucleotide 



polymorphisms (SNPs) analysis (Chatterjee et al. , 2006 Hahn et al. , 2003 Maity et al. , 2009 



Moore et al. 2010 Ritchie et al. 2001). However, single-gene based methods have many 



limitations. For instance, the effect of one gene on a disease is difficult to interpret and 
current methods are unable to model gene dependencies so that they may not detect genes 
with moderate changes that give more insight into biological processes but pick up single 



gene with dramatic changes (Mootha et al. , 2003). For these reasons, gene-set or pathway- 



based approaches have attracted increasing attention in recent years ( Goeman et al. , 2004 



2005 


Liu et al. 


2007 


Wang, et al. 


2007; 


Pang et al. 


2006, 


2011 


Kim et al 


2011) 



recognized that a joint study of the association between the outcome and a group of genes 
within the same pathway could complement genes/SNPs analysis for providing insight in 



understanding complex diseases (Wang, et al. 2007). 

A genetic pathway is the interactions of genes that depend on each other's individual 
functions and act accordingly to create the aggregate function related to a cellular process 



(Goeman et al. 2004). There are several special characteristics of pathways, such as various 
dimensionality (a pathway can contain several genes or over a thousand ones), and inter- 
action network (genes within the a pathway are expected to function together and hence 
interact with each other). Thus traditional statistical analyses face difficulties in handling 
these situations. For instance, linear parametric models usually either fail due to the "curse 
of dimensionality", or end up with computational explosion in the number of possible in- 
teractions among genes within a pathway. To deal with these difficulties, many innovative 



statistical methods have merged in recent years. Goeman et al. (2004) proposed a global 



test derived from a random effects model to determine the significance of the global ex- 



pression pattern of a group of genes. A random forests approach was proposed by Pang 



et al. (2006). Liu et al. (2007) proposed a semiparametric model for covariate and genetic 



pathway effects on continuous outcomes, where the covariate effects and the pathway effect 
are modeled parametrically and nonparametrically, respectively. They established the con- 
nection between the least squares kernel machine (LSKM) and linear mixed models, which 



simplifies specification of a nonparametric model with multi-dimensional data. Pang et al. 



(2011) considered more complicated situations with two or more pathway effects presented 
in the linear mixed model, which allows the researcher to study how multiple pathways relate 
to the phenotype of interest. A semiparametric Bayesian approach has also been proposed 



for evaluating pathway effects on clinical outcomes Kim et al (2011). However, despite the 
success of analyzing pathways instead of a single gene, all existing methods ignore the envi- 
ronment exposure covariates, and still fewer focus on the interaction between environmental 
variables and the genetic pathways. 

It has been recognized that genetic factors alone cannot account for many cases of gene 



related disease ( Adami, et al.| 2008 Chakravarti and Little 2003). The gene-environment 
(G-E) or pathway-environment (P-E) interactions are critical in understanding the dynamic 
process of disease since ignoring them may mask the detection of a genetic effect and may 



lead to inconsistent association results (Manolio et al. , 2006). Furthermore, understand- 
ing the G-E interactions can be important for risk prediction and evaluating the benefit 
of changes in modifiable environmental exposures or environmental regulations. For these 
reasons, the number of studies utilizing gene-environment interactions has increased dra- 
matically. These range from semiparametric linear or logistic regression models with linear 



combinations of genes/SNPs as the predictor (Chatterjee et al. 2006 Maity et al. 2009 



Park and Hastie , 2008 ) to the multifactor dimensionality reduction (MDR) as a data mining 
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technique for identifying genetic and environmental effects associated with either dichoto- 



mous or continuous phenotypes (Ritchie et al. 2001 Hahn et al. , 2003 Moore et al. 2010). 
Unfortunately, these studies are all genes/SNPs based methods, and they possess problems 
in dealing with the pathway analysis. For example, representing the pathway effects with 
linear combinations of genes has limitations in detecting non-linear patterns of interacting 
genes. Furthermore, the number of genes in a pathway can be in the hundreds or thousands, 
which makes modeling the gene-gene or gene-environment interaction very consuming. 
To capture high order interactions within the high dimensional genes regressor space as 



well as the G-E interactions, Zou et al. (2010) employed a nonparametric regression model 
with a Gaussian process. With their model the gene and environmental variables are modeled 
non-parametrically, and all of the possible interactions effects are considered simultaneously. 
However, using one Gaussian process to describe both gene and environmental variable 
function spaces results in all the interaction effect being indistinguishable. Thus it is almost 
impossible to apply a suitable test for interesting effects such as G-E interaction. 

In this paper, we propose a semiparametric mixed effects model to include environmen- 



tal variables, genetic pathway effect, and their interaction. By extending Liu et al. (2007)'s 
linear mixed model to our model, we evaluate the interaction between an environmental vari- 
able and pathway as well as allow nonlinear relationships between the environmental variable 
and a continuous outcome. Assuming that both the pathway and interaction effects have 
multivariate normal distributions with a zero mean and covariance structure with specific 
kernels, we model them within the framework of Gaussian processes. Thus in our model both 
pathway and interaction effects are indeed modeled as random effects. Instead of modeling 
the environmental variable as a parametric fixed effect, we model it non-parametrically via 
natural cubic spline. By modeling environmental variables and pathways in this way, we can 
construct the kernel for the P-E interaction based on the analysis-of- variance-like (ANOVA- 
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like) decompositions of functions (Wahba 1990 Gu and Wahba 1993) for a multivariate 
function. The feature of our method is to model the interaction between environmental and 
pathway covariates separately from the interactions among genes within the pathway, which 
are automatically modeled by the Gaussian process for pathway effect. Our model also ex- 
tends the additive and interaction smoothing splines for univariate functions to multivariate 
functions with arbitrary kernel. 

In a mixed model, the smoothing parameters of the spline and the Gaussian kernels can 
be considered as the variance components of the random effects, and thus are simultane- 
ously estimated by maximizing the restricted maximum likelihood (REML). By additively 
modeling the multivariate functions, this model is suitable for analyzing genetic pathway 
data in which the P-E interaction attracts particular interests. Furthermore, the covariance 
structure of our model makes the test of the "overall" pathway effect or P-E interaction 
effect possible. By "overall" we mean either the main effect of a pathway, the interaction 
effect associated with the pathway, or both. The restricted likelihood ratio test (RLRT) of 
two zero variance components under non-standard conditions is employed to test the overall 
pathway effect, while the RLRT of one zero- variance component and score test are applied 
to test the P-E interaction. 

We first define our model in Section |2j and discuss two REML methods to estimate 
the model parameters in Section |3j Then in Section [IJ we introduce PLRT statistics for 
testing two or one zero- variance components and the score test for testing the P-E interaction 
effect. In Section [5j we present a set of simulation studies concerning nonparametric function 
estimates and variance component tests for various settings. In Section [6] we apply our 
method to the genetic pathway data for Type II diabetes. Finally, in the last Section, we 
conclude our work and discuss potential extensions of our model. 
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2 Construction of Semiparametric Linear Mixed Ef- 
fects Models 

2.1 Model Description and the Kernel of the Interaction Function 
Space 

Let us consider that we have a total of n subjects and the ith subject has a continuous 
disease-related outcome y^i = 1,2, ...n. We are interested in relating this response y = 
(yi, i/2, ...,y n ) T with one particular pathway gene expression data Z = (zi, z 2 , z n ) T and k 
environmental variables. In a general form, we can write this nonlinear relationship as 

y = f + e, (1) 

where e and f are nxl dimensional vectors with a specific relationship with y for the ith 
entry as = f{xj, zj) + ej, in which xj = (xn,Xi2, Xn~) is 1 x k vector of environmental 
variables and zj = (zii,z i2 , ...,z ip ) is the 1 x p vector of gene expression within a pathway 
and p is the gene number. In this paper, we only consider the case with one environmental 
variable, i.e., k = 1 so that the input x T is reduced to univariate x. We assume that the 
errors e ~ iV(0, a 2 I) are n x 1 iid random variables vector. /(•) denotes the unknown non- 
linear smooth functions for Xj, zf , and their interaction. In this paper, we assume function 
/ has the following form: 

f{x, z T ) = (3 + f x (x) + f z (z T ) + f xz {x, z T ), (2) 

where (3 is the intercept term, and f ai ot G {x,z,xz}, represents the nonlinear effect of the 
environmental variable, the pathway or the interaction respectively. The above equation is 
similar to the additive model with two univariate variables and their interaction, except z T 
is a multivariate variable. By writing the general model ([T]) in this way, we can estimate 
f x , f z and their interaction f xz separately according to the characteristics of the pathway 
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and the environmental variable. We model f x {x) using the nonparametric function such as 



a cubic smoothing spline (Wahba 1990 Lin and Zhang 1999 Zhang and Lin 2003). To 
handle the high dimensional pathway covariates, z T , we may consider a Gaussian process to 
express f z {z T ) since the least squares kernel machine method with the Gaussian kernel has 



achieved success in a genetic pathway data analysis (Liu et al. 2007). 

Before we derive the specific representation for the interaction function, we need examine 
the function space of f x and f z respectively. For the smoothing spline x e T = [0, 1], f x is 
spanned on the function space 7i x = l-i x where 7i x and 7i x represent the direct 

sum operator of two subspaces, the null function space and the penalized function space 



respectively (Wahba, 1990). Assuming n distinct values of X{ such that < x\ < ■ ■ ■ < x° n < 



1, the mth order smoothing spline estimator f x (x) can be expressed as (Wahba, 1990 Zhang 



and Lin 2003), 



fx(x) = ^bj^jix) + ^2c i k x (x,x° i ), 

j=l i=l 

where <pj(x) is the polynomial basis that span the null space with <pj(x) = x^~ x j{j — 
l)!,j = l,2,...,m, and k x (x,xf) = [(m - l)!] -2 J(x - w)+ -1 (^ 9 - u)^T l du is the kernel 
which uniquely determines the space For m = 2, the natural cubic spline that we shall 



apply in our model, the kernel of H x can be calculated as (Hastie et al. , 2009 Rasmussen 



and Williams, 2006) 



k x \Xj x ) 



(x — u)+(x' — u)+du 



mm [x, x' 



A3 



+ 



min(x, x') 2 \x — x 1 



(3) 



where subscript "+" indicates the positive part of the expression. For the null space %° x) the 
kernel is calculated as k x (x, x') = Y^j=i — 1 + xx ' ■ 

With the orthonormal polynomial basis, ~H X = {1} where {1} and {x} stand for 

the linear function spaces spanned by the constant 1 and the linear basis x which is centered 



(Guo, 2002). Since the kernel of the function space of the direct sum of two subspaces is 
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expressed by the direct sum of the kernel of the subspaces (Aronszajn, 1950 Wahba, 1990), 



we can derive the kernel of the function space without the the constant term for the cubic 
smoothing spline, {x} ©"H^, as \xx' + k x (x,x')]. 



For the function space of f z , we consider a similar argument by MacKay (1998) that 



starting from a parametric model, we can span the function of f z by a radial basis 



H 



(4) 



h=l 



where (ph(z T ) = exp 



\z-z h \ 
2p 



is the radial basis functions centered at fixed points {z h }j^ =l . 



Assuming c = (ci, qJ t ~ N(Q, t z I), the entry of the covariance matrix of f z is expressed 



as 



R = r z J2' 



) h \z)<p h {z ). 



Taking as an example a one-dimensional case, MacKay (1998) shows that in the above 



expression the sum over h becomes an integral when taking the limit H — > oo such that 
R = t z exp [—(z — z') 2 / p]. Generalizing from this particular case, we can define the Gaussian 
kernel of the function space Hi on z 



kAz\z' T 



exp (- 



\z — z 



/ 1 1 2 



IP) 



(5) 



and we assume that f z is generated from a zero mean Gaussian process with the kernel 
matrix produced by k z . 

Since the tensor product of the kernels of two function spaces determines a new function 



space (Aronszajn, 1950), we use the tensor product of the kernels of {x}@%\. and "H* to 
construct a new function space, T-L], z , which contains any order interaction f xz between x and 
z T . Now we can express the kernel of the interaction function space as 



kxz [x, z j x , z "\ — \xx + k x (x, x )] • k z (^z , z J . 



(6) 
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Therefore, we are able to represent the nonparametric interaction function using a zero mean 
Gaussian process with the kernel matrix produced by this kernel function. 

In the rest of this paper, we use K x , K z and K xz to stand for the Gram or kernel matrices 
produced by k x , k z and k xz respectively. In a specific problem, the environmental variable x 
must be scaled into T = [0, 1] to construct the interaction kernel. Notice the model expression 
(|2j) is not the analysis of variance (ANOVA) decomposition of the smoothing function / since 
1-L\ and H xz are not orthogonal to each other. This may cause the identifiability problem 
between f z and f xz . However, in practice, this problem only happens to our model in extreme 
situations such as when the entries of matrix xx'+k x (x, x') are close to each other. In general, 
f z and f xz can be identified well as shown in the simulation and application study. 

2.2 Linear Mixed model Representation 

Now we are prepared to pose the optimization problem. Based on the above argument, 
the corresponding function spaces that are penalized are and H]. z . Analogous to 

the additive models (Hastie and Tibshirani, 1990), the estimation problem for model ([!]) 



becomes: for a given set of predictors (xj, zj),i = 1, 2, n, find / to maximize 

-^(y-f) T (y-f)-^E A «H^II^' ( ? ) 

a 

where H/allw 1 's are the norms induced by K a of H\, a G {x, z, xz}, and A a 's are the penalty 
parameters that balance the tradeoff between goodness-of-fit and smoothing of the curve or 
high dimensional surface. The solutions to expression Q are called the least square kernel 



machine estimation, and Liu et al. (2007) showed the equivalence of the least square kernel 
machine to the linear mixed model without interaction effects. 

The model @ can be represented in terms of a mixed model as follows. According to 



the Representer Theorem (Kimeldorf and Wahba, 1971), the nonparametric function can be 



expressed by the kernel, /,(•) = zf ) and f xz {-) = J2i=i b i k xz(-'i x h z f)- So the 
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vectors of these functions are 



K z a, 
K xz b, 



where a>i G K, bi G K. Based on the properties of reproducing kernels, the squared norms of 
Hi and 1-L XZ can be expressed as 
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||/, z ||^ = b T K xz b = flK~ z %,. 
To represent the remaining part of model (j2|, [3 + f x (-), we follow Lin and Zhang (1999); 
Zhang et al. (1998); Green (1987); Green and Silverman ( 1994[ )'s procedure. The vector of f x , 
f x (note here the constant /3q is absorbed into f x ), can be expressed in terms of f3 = (/Jo, fii) T 
and in — 2) x 1 random vector r r as 



f a = XP + Br x (8) 

for n distinct input x values, where ~ N(0,r x I) and X is the design matrix of the null 
space T-L° x spanned by the orthogonal polynomial basis, i.e., X = (l,x) and x is the n x 1 
vector of centered x. B is a matrix defined as B = L(L T L) _1 , where L is n x (n — 2) full 



rank matrix with M = LL T . M is a penalty matrix defined by Green and Silverman (1994) 
such that the squared norm of 



II/: 



i=J [m)fdt=^Mi x 



More details to define B and M can be found in Green and Silverman (1994), Zhang et al. 



(1998) and Appendix A 



Plugging those representations of square norms and f a 's back into Q, we have 

-\{y - ff(y {Kr T x r x + x z fjK z X + x x jIk- z X z ) . 
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If we define X x = <j 2 /t x , X z = <j 2 /t z and X xz = <j 2 /t xz , and have random vectors r z = f z , r 2 
N(0, t z K z ) and 

^ xz — ^xzi^xz ~ -^(0, t xz K X z) ) then the above equation is equivalent to 



I 



I 



—Ay - f) T (y - f) - — r T x *x - —v t z k- x v z 

2a 2 W ' W ' 2tt x 2r z z z 2r, 



-r T K~ l r 

L xz ly xz L XZ) 



(9) 



which is the triple penalized log likelihood function of the linear mixed model 

y = f + e = X/3 + Br x + r z + r xz + e. (10) 

From the Bayesian point-of-view, f is interpreted as the sum of four zero-mean stationary 
Gaussian processes, each with a prior covariance function r a K a (/3 can be viewed with 
infinite variance). The vectors r z and r xz have more specific meanings as the pathway main 
effect and the P-E interaction effect. Although does not have such a meaning, it can 
be interpreted as the nonlinear contribution of the relationship of the response and the 
environmental variable. 

Differentiating expression ( JToj ) with respect to (3 and r^'s, it is easy to show that the 
best linear unbiased prediction (BLUP) estimate of the random effects, given a 2 and r a 's as 
fixed, is obtained from solving 



'IV. 



X T X 


X T B 


X T 


X T 




p 




X T y 


B T X 


B T B + XJ 


B T 


B T 


X 


r x 




B T y 


X 


B H 


-UK,}- 1 


I 


r z 




y 


X 


B 


I 


I + ^xz[K xz ] 




*xz 




y 



Equation (11 ) shows that the BLUP estimate of P and r a 's are unique if X X is full rank 



which is usually satisfied. 



2.3 Estimate Pathway and Interaction Effects 

Given the fixed parameters a 2 and r a 's, the covariance of y is obtained as follows usinj 



model (10) 



E = Cov(y) = a 2 l + r x BB T + r z K z + r xz K xz . 

12 
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Instead of solving expression (11) directly, we perform recursive steps to simultaneously 



achieve the approximate expressions of (3 and r Q 's, a G {x,z,xz}, 



P = (X T ^- 1 X)~ 1 X T ^- 1 y, 

r, = (B T A^B + r- 1 /)" 1 B^AT'iy ~ Xp), 

r z = (Aj 1 + t^K- 1 )' 1 A^(y -XP- Br z ), 

r xz = (A3 1 + t-}K-})- X A z \y - X? - Br x - r z ), 

where I is the (n — 2) x (n — 2) identity matrix, and Aj, j = 1, 2, 3, are covariances for the 
following distributions, 

y = X/3 + e , e ~iV(0,A = S), 

y-Xj3 = Br x + e 1 , 61 ~ iV(0,Ai = a 2 / + + r^^ 2 ), 

y-XP-Bv x = r z + e 2 , e 2 ~ N(0, A 2 = a 2 I + t xz K xz ), 

y-XP-Bv x -v z = v xz + e, e ~ iV(0, A 3 = a 2 /). 

The above expressions for /3 and r a 's are all linear transformations of y; thus, their covari- 
ances are easily determined using identity Cov(j4y) = ACov(y)v4 T = AT,A T , where A is the 



transformation matrix in expressions (13). 



3 REML Estimation of the Variance Components 
3.1 REML Approach for Estimating Variance Components 



In the previous Section, when solving the equation (11) we assume that the regularization 
parameters, r x , t z and t xz , the scale parameter p for Gaussian processes, and the error 
variance a 2 are already known. In this linear mixed model framework, we can estimate 
the parameter 6 = (a 2 , t x , t z , t xz , p) T simultaneously using restricted maximum likelihood 
(REML) estimation. REML is superior to the maximum likelihood (ML) method in terms of 



adjusting the small sample bias (Zhang and Lin, 2003). The REML of our model is derived 
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routinely (Harville, 1977) up to the usual additive constant 



log |E| - -|X T S- 1 X| - -(y - X^) T E- X (y - Xf3) + c, 



(15) 



where c is constant. Another advantage of using REML is that it accounts for the degrees-of- 



freedom adjustment of replacing (3 with (3 in expression (15) (Breslow and Clayton, 1993). 



Taking the derivatives of (15) with respect to 6, the estimates of 8 are obtained by solving 
dl R 



da 2 

m 

dr ( 
din 
dp 



^Tr(P) + i(y - X^) T E- 1 S- 1 (y - X&) = 0, 



t = ~l Tl (H P ) + \ {y ~ X ^ )TS_1 || S_1(y " X ® = °' aE ( 16 ) 



1... fdE 
2" [dp 



Tr ( F P ) + o (y - XfrfYT^Yr^y - XP) = 0, 



dp 



E- 1 X{X T E- 1 X)- 1 X T E-\ and g 



where P = £ 1 
mation matrix Z(0) has the z, jth entry as 

1 



r z ^ + r xz ^. The 5 x 5 infor- 



(17) 



and the variance of can be estimated through the expression of the information matrix. 



Equation (16) can be solved using an iteration method such as Fisher's scoring method. In 



practice, the sample size n may be small, for instance the Type II diabetes data contains 



only 35 observations, while the model (10) includes two fixed-effect parameters and three 



smoothing parameters. We may have problems with overparameterization, and it may cause 
a negative estimate of the variance components based on REML. In such case, the step- 



halving method can be adopted (Jennrich and Schluchter, 1986), but still the corresponding 



variance component can be estimated as very close to zero. 

3.2 Profile REML Approach for Estimating Variance Components 

In this Section, we suggest a modification to the REML estimation of the variance com- 
ponents so that the estimate of the error components always remains in the parameter 
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space. This new approach makes the use of the profile restricted maximum likelihood (p- 



REML). The covariance of y in expression (12) can be written as S = cr 2 T,\, where £a = 
(I + \~ l BB T + \- l K z + \~lK xz ). Defining the matrix P x = T,^ 1 — T,'^ 1 X(X T 'E^ 1 X)~ 1 X'E^ 1 , 



and P = P\/<J 2 , the restricted log likelihood function (15) can be rewritten as 



I 



R 



l -{n-q) \og{a 2 ) - I|E A | - I ]o K \X T ^X\ - I*** 



+ c, 



2 ^ 2 a 2 

where q = 2 is the rank of X. Assuming that X a ,a G {x,z,xz} are known, by solving the 



derivative of (18) with respect to a set equal to zero, the p-REML estimator of a is 



a 



n — q 



(19) 



Since P\Xa is idempotent, y ~ Xt(p x )j w h ere r (Px) = Tr(P\) is the rank of P\, the 



variance of Var(a 2 ) ~ 2er 4 Tr(PA)/(n — <?) 2 - Plug a 2 back into expression (18) and we have 
the log profile restricted likelihood (PRL) function 



PR 



-^ lo g| s A| 



2 1 A 1 2 



log(y T P A y) + c. 



(20) 



Now we can use the similar scoring algorithm to estimate 6* = (X^, X z 1 , X x }, p). By simple 
algebra the score of the p-REML likelihood is 
dlpR 



06* 



X Tr P Ea P 
_ 2 1 ~O0* 



1 jr-dEx 



2a 



2 y 2 Px^Pxy,J = 1,2,3,4, 



(21) 



and the i,jth entry of the information matrix X*{0*) for the PRL can be approximated as 
1 



1*(0* 



2(n - q) 



09* A 06* 



06* 



09* 



• (22) 



Note that I* (6*) is positive definite when n is large enough. Claeskens (2004) also showed the 
convergence of X*{6*) under regular conditions so that we can apply the restricted likelihood 
ratio test (RLRT, see Section [5]). Since PRL is not a true likelihood, we only use PRL 
for statistical test purposes, and use p-REML to obtain a better estimate of the variance 



components. The variances of is found by plugging the p-REML estimates into (17). 
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4 Test for Pathway Effects 

4.1 Test for Two Zero Variance Components 

One of the primary problems in the study of pathway based analysis is testing the "overall" 
pathway effects. Recall that the meaning of "overall" refers to either the main effect of a 



pathway, the interaction effect associated with the pathway, or both. In model (10), two 



random effects are involved with the overall pathway effects. Thus, the hypothesis for testing 
the overall pathway effect is 



H :t z = t xz = vs. H a : r z > or r xz > 0, 



(23) 



which is equivalent to the following test 



Ho : A; 1 = A" 1 = vs. H n : A; 1 > or A" 1 > 0. 



(24) 



For this type of test problem, a likelihood ratio test (LRT) is most commonly used. Note 
that parameter space for 6 = (A" 1 , A^ 1 , \~ z , p) T equals [0,oo) 3 x (0, oo) (to avoid abuse 
of notation, in this Section, 6 and X stand for counterparts of PRL). The true parameters 
do are either in the interior or on the boundary of the parameter space, so the LRT is 



nonstandard. Vu and Zhou (1997) generalized the hypothesis test for both interior and 



boundary problems within a setting of mixed regression fitting, so it allows the nonidentically 
distributed response variable y^s to depend on the covariates and allows the random effects 



to induce dependence between the response values. (Claeskens, 2004) further extended the 



non-standard LRT test to the profile restricted likelihood ratio test (RLRT), focusing on 
nonparametric mixed models with spline fitting. 



Following (Claeskens, 2004), we apply RLRT to test hypothesis (24). Under this hypoth- 



esis, the RLRT test statistics, D, is the deviance of two times the log PRL, —2lp R (6), i.e. 
D = 2l PR {6) — 2l PR (0 ). Note that D is the same using either l R or l PR . Assuming that the 
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corresponding regular conditions in Vu and Zhou (1997) are satisfied for the PRL function 
model, D converges to 

inf \\U - Of- inf \\U - 6\\ 2 , (25) 
0ec o oec 

where C = {6 : = Z(0q) t ^ 2 (6 — Oo),0 € Cq} is the orthonormal transformation of the 
cone approximation, Cq, of the parameter space Q with 0q as the vertex, and Co = {6 : 
6 = I(6 ) T / 2 (6 — ),6 E Cq } is the orthonormal transformed cone approximation of the 
parameter space Qq under the null hypothesis. U is a random vector from N(0,I), and 
Z(0 O ) T / 2 is the right Cholesky square root of p-REML information matrix, i.e. X(# ) = 
[Z(0 o )] 1/2 [Z(0 o )] T/2 . 

Note that under the null hypothesis, 6 = (A" 1 , 0, 0, p) T , p is inestimable. We suggest 
estimating the parameters with p fixed at the average of ||z — z'|| 2 (average on pairwise 
observations) to not only reduce the parameter space dimensions but also achieve a better 
fit. Let 6 = (A" 1 , A" 1 , A~ Z X ) T = (6*i, 9 2 , 3 ) T . Now the cone parameter spaces are reduced to 
Cn = [0, oo) 3 and Cq = [0, oo) x {0} x {0}. However, in this problem, all three parameters 
can be on the boundaries and the orthonormal transformation for the nuisance parameter 
Q\ is not invariant, which leads to a transformation for 3 dimensional space. The calculation 



of (25) in a 3 dimensional space becomes considerably more difficult when the information 
matrix is not diagonal. To simplify the calculation, we consider the special case that Q\ ~ 0, 
which is a reasonable consideration for the Type II diabetes data in a later Section, where 
the p-REML estimates of #i's are very close to zero for most pathways. 

Now the parameter space is reduced to 2 dimensions. Under the orthonormal transforma- 
tion, the cone spaces become to C = {8 : 7^3 — 9 2 > 0, #3 > 0}, and C = {0 : 9 3 = 9 2 = 0} , 
where 7 = X 23 ■ |X(# )|~ 1//2 is the slope of the axis #3 after transformation as shown in Figure 
[11 a). To account for the fact that 9\ is estimated, Z(# ) is defined from the 3x3 information 
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matrix T(Qq) as 



Wo) 



%22 


%23 




%22 


2^23 




x 21 


X32 


X33 




x 32 


x 33 _ 




_x 31 _ 



^"11 [Xl2,Xi 3 ] 



From the graphic point of view, the representation of the test statistics (25) is determined 
by the minimum distance of the independent normal vector U = (U2, U 3 ) T to 6. Under the 
alternative hypothesis, the minimum distance, inf 0g( ^ ||Z7 — #|| 2 , can be understood as the 
projection of U on the cone space C when U is outside of the cone. As shown in Figure 
[T|a), the representations of inf 0g( =j \\U — 6\\ 2 are different in the four regions of the plane with 
coordinates (6*2, 6*3) 

' e 3 > 0, 7 #3 - o 2 > 0, / 

^2 2 + t4 2 -(7t/2 + t/3)7(l+7 2 ) #3 + 7#2>0, 7^3-^2<0, II 

ui e 3 < 0, e 2 > 0, /// 

t/| + t/ 3 2 # 3 + 7#2<0, 6 2 <0, IV. 



inf ||t/-0|| 2 

eec* 



(26) 

The area proportions, (0,1/4,1/4,1/2 — <p) as in the aforementioned order, of these four 
regions determine the probabilities that the vector U lies in which region, where = 
cos- 1 (7 • (1 + 7 2 r 1/2 ) = X 23 ■ (X22X33)- 1 / 2 . 

Under the null hypothesis, the parameters space is reduced to the origin of the plane, 



thus according to Vu and Zhou (1997) 



inf \\u -e\\ 2 = uj + ul 



Then the asymptotic distribution of D is the difference of the above two representations 



D < 



^2 2 + U% with probability <p, 

(jU 2 + U 3 ) 2 /(l + 7 2 ) with probability 1/4, 
C/| with probability 1/4, 







with probability 1/2 



I 

II 

III 

IV. 



(27) 
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Note that because U2 and C/3 are independent, thus (7^2 + Us)/ y/l + 7 2 ~ iV"(0, 1), and the 
final approximate asymptotic distribution of D is 



D 



+ O.5 X 2 + (O.5-0)x 2 . 



(2? 



In this paper, we suppose lim^^ I7I < 00. If lim n ^ 00 I7I — > 00, the representation of 
info^^ \\U — 6\\ 2 is in different form (Vu and Zhou, 1997) and the asymptotic distribution of 



D may be different. An additional approximation is that we obtain 7 with a finite sample 
size under the null hypothesis, so we assume that n is large enough that the finite 7 is close 
to the converged value. 



4.2 Test for the P-E Interaction Effect 

The RLRT for two variance components introduced above allows us to test the overall 
pathway effect. Furthermore, we may be attracted to testing single variance components, 
such as testing the P-E effect, given that the overall the pathway effect test is significant. 
The hypothesis of this problem is 



H : A" 1 = vs. H a : > 0, 



(29) 



which is equivalent to testing H : r xz = vs. H a : r xz > 0. The RLRT test statistics 
d = 21pr(0) — 21pr(0 ) for one variance component in semiparametric model with PRL 



was also suggested by Claeskens (2004), and an exact RLRT algorithm was proposed by 



Crainiceanu et al. (2005). Unfortunately, this exact RLRT method cannot apply to test 



(29) for model (10). In their work, there are no random effects in the model under the 



null hypothesis, thus d can be represented exactly as the form of a mixture of chi-square 



distribution. On the contrary, our model (10) under the null hypothesis of (29) contains two 



random effects r x and r z , which makes it impossible to represent d exactly. 
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The second choice is to use the method described in the previous section using an asymp- 
totic distribution. However, we need the same approximations; that is, we fix p and assume 
that the relationship between the response and the environmental variable is almost linear, 
i.e. t x Rj 0. Then similarly, the parameters cone space is reduced to 2 dimensions. One 
interesting parameter #3 = A~ z , and one nuisance parameter 62 = A" 1 , both have the true 
values on the boundary. Thus, Cu = [0, 00) x [0, 00) and Cn = [0, 00) x {0}. 
Under the approximations described above, the asymptotic representation of 2 times the log 
PRL function under the null hypothesis is 



inf \\U - 6\\ 2 = ■ I(U 2 > 0) + U 2 2 I(U 2 < 0) + f/ 3 2 , (30) 
0ec o 

where /(•) is the indicator function. The representation under the alternative hypothesis is 



the same as in (26), but because the cone under the null hypothesis is no longer the origin of 
the (92,0s) plane, inf 0g( 5 o \\U — 6\\ 2 has two regions as shown by (30). Now we must divide 



the plane with coordinates (62,03) into five regions and set the approximated asymptotic 
representation of d as (see Figure pjb)) 

U 2 with probability 1/4, I 

U^ + Ul with probability 0- 1/4, I* 

d—t (lU 2 + U 3 ) 2 /(l + ^ 2 ) with probability 1/4, // (31) 

with probability 1/4, 77/ 

with probability 1/2 — 0, IV. 

Thus, we have the asymptotic distribution of d for testing 6*3 = A^ 1 = or t xz = 

d ~ (0 - 0.25)^ + 0.5x1 + (0.75 - cf>)xl, (32) 



where <p is calculated through 7 under hypothesis (29). 

In many cases, the relationship between the response and the environmental variable is 
not linear, i.e. r x is significant and not equal to 0, then we are in the 3 dimension space to 
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derive the asymptotic distribution of the d, which becomes arduous. In this situation, we 



adopt a score test approach based on the REML function (15) which was proposed by Lin 



(1997) in a mixed model. The asymptotic distribution of the REML score may not converge 



to a standard normal distribution, Zhang and Lin (2003) suggested using the scaled chi 



square approximation of the test statistics. More generally, the REML score for covariance 



component r a ,a G {x, z,xz} of (16) can also be written as 

dr a ~ 2™ dr/ y 2 ir V dr a ' ' 
where we used identity (y — X/3) T Y I ~ 1 = (Py) T . P can be expressed as P = r(r T £r) _1 r 



l-rT 



(Searle et al. , 1992), where T T is (n — q) x n matrix with full row rank n — q (q = 2 is the 



rank of X). The matrix T T satisfies T T X = and T T y ~ iV(0,r r Sr). Thus the REML 
version score test statistics can be written as 



1 <9£ 
U Ta = -(Py) T —Py = y T My, 



(33) 



where y = (r T sr)-5r T y with y ~ n(o, /„_,), and m = ^(r T Er)-5r T ||r(r T Er)-i 

U Ta is the quadratic form of y with mean E(U Ta ) = |Tr (P§^j and variance Yax(U Ta ) = 



Xjj, where Xjj is the corresponding entry of the information matrix (17) for the interesting 
variance component of r a G {t x , t z , t xz }. 

Let r denote the number of non-zero eigenvalues of M, then M can be further decomposed 
using the spectral decomposition as M = HEH T = Y^i=i j where H = (hi,...,h r ) is 

n x r orthogonal normal matrix, i.e. hfhj = 5ij, and 5 = is r x r diagonal matrix. It 
follows that 

r r 

U Ta = y T HEH T y = J>y T /^[y ^J^^xl 

i i 

Therefore, under H , the distribution of U Ta can be represented as a weighted mixture 
of chi-square distribution. This is because y T hihfy ~ xl since hihf is an idempotent 



matrix with rank 1. Because the calculation for £j's is intensive, we follow Zhang and Lin 
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(2003) in using the Satterthwaite method to approximate the distribution of U Ta by a scaled 



chi-square distribution k\ 2 , where k = Tjj/2E(U Ta ), and v = 2E{U Ta ) 2 /Tjj. Zhang and 



Lin (2003) also suggested to further account for the fact that 6 = (u 2 ,t x ,t z ,t xz , p) T is 



estimated, so that k and v are calculated by replacing Tjj with the efficient information 
Tjj = Tjj — Tjtfl^ljfl, where Tj# and T^ are the corresponding vector and matrix if we 
rearrange the 5x5 information matrix T(0) as 



1(6) 



Tj$ Ttfd 



In this paper, we are particularly interested in testing the P-E interaction effect, i.e., r xz 



5 Simulation Study 
5.1 Parameters Estimation 

We carried out the simulation study to evaluate the accuracies of the estimators; 200 runs 
were performed for each of the simulation scenarios. Let p denote the number of genes in 
the pathway and n denote the number of observations. We considered a setup that mimics 
the real diabetes pathway data with a total of 50 genes within a pathway. The true model 
of the zth observations is 

Vi = f x {xi) + f z (zj) + f xz (xi, zj) + a, €i ~ N(0, a 2 ) 

with nonparametric functions 

fx( x i) — 5-6 + O.lxj + cos (xj7r/18) , 

/ 2 ( 2 f)=a^f 0) exp(-0.2|z|f 0) )/5, (34) 
/, 2 (x l , 2 f)=&-e- /10 sin(^ 30) )cos(zf 0) )/8, 

where zf°\ {zf^ and zf ^ stand for Y^jLi z ij > Si=i l%l/30 and Y^=i z ij/^- We sample 
Xi and (J = 1, ...,50) from Uniform[18, 36] and iV(0, 1), respectively. Furthermore, a and 
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b are parameters to control the magnitude of the nonparametric functions respectively. In 



this Section they are fixed at a = 1.5 and 6 = 2. In the true model (34), a total of 30 genes, 
zn, Zj3o, are involved. However in a real situation, we may fit the model with extra genes 
that are not involved in the true model. Thus we consider the following settings for model 



Q34J): 

Setting 1: n = 100/150, true p = 30, fitted p = 30, a 2 = 0.2 2 , 

Setting 2: n = 100/150, true p = 30, fitted p = 40, a 2 = 0.2 2 , 

Setting 3: n = 100/150, true p = 30, fitted p = 50, a 2 = 0.2 2 . 
For each setting, two sample sizes n = 100 and 150 were considered. 

In Section [3] we introduced two methods to estimate the variance components using 
REML and p-REML. We are particularly interested in comparing the performance of these 



two methods. One of the difficulties of solving equation ( 16 ) or (21 ) using a scoring method is 
finding the initial values for 9 or 0* , since there are no analytic expressions to roughly obtain 



those initial values. Breslow and Clayton (1993) suggested starting the variance parameters 
from small positive values within a complex situation. We started the variance components 
with (a 2 , t x , t z , t xz ) t = (0.001, 0.001, 0.001, 0.001) T , which is equivalent to starting with 
(cr 2 , A" 1 , A~ 1 , X XZ ) T = (0.001, 1, 1, 1) for p-REML. For scale parameter p, we can either fix or 
estimate it. In this simulation study, we choose the initial value p = 2 which is the average of 
|| z — z' || 2 on all pairwise observations if it is estimated. We also compare the results with p 
fixed at 2. Note that if p is estimated, we consider two possible ways. One way is to perform 
a two-step procedure where we first fix p at 2 and evaluate (a 2 , t x ,t z , t xz ) until convergence 
and then use the results with p = 2 as the initial values to evaluate (cr 2 , r x , r 2 , t xz , p) until 
convergence. The other way is to evaluate {<j 2 ,t x ,t z ,t xz , p) together from an initial value 
(0.001, 0.001, 0.001, 0.001, 2) T . The simulation results show that the former method is more 
stable, so only these results are shown. Similarly, a two-step procedure was used for p-REML 
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when p is estimated. 

To demonstrate the fitting results, Figure [2] shows one selected example of setting 1 
comparing estimated f,f x ,r 2 and r xz with the true ones. The overall response f is fitting 
very well as shown in Figure [2^d). As shown in Figure [2^b) and (c), there is not much 
identifiability issue since both the fitted pathway effect r z and fitted the interaction effect 
i xz capture the patterns of the true ones very well. 

To have a overall evaluation of the goodness-of-fit of the nonparametric function f x , f z 



and f xz , we followed the techniques used by Liu et al. (2007), who suggested regressing the 
true nonparametric functions on the fitted ones. By reporting the average intercepts, slopes 
and i? 2 's from these regressions, the goodness-of-fit of the fitted nonparametric functions can 
be assessed empirically. The closer to and 1 of the intercepts and slopes are and the closer 
to 1 of R 2 is, the better the performance of the estimation is. 

In Table [I] we summarized the goodness-of-fit of f a , a G {x, z, xz} for 200 hundred runs. 
The scenarios of three settings were used in four procedures: I) REML with p estimated, II) 
REML with p fixed at 2, III) p-REML with p estimated, and IV) p-REML with p fixed at 
2. It can be seen that the performance of using procedure I) is not so good; p goes to an 
extremely large value and f a 's deviate from / Q 's. This may be because the REML likelihood 
function dose not have a maximum and the likelihood increases or becomes flat with p. In 
such a case, the entries of K z becomes a matrix of ones. One solution when the REML 
function becomes flat with p is to fix p at the turning point of the REML function. In 
procedure II) we fixed p at 2. The average of \\z — z'\\ 2 on all pairwise observations is very 
close to 2 and using this p allows us to avoid having extreme values for the entries of K z . The 
performance of this procedure is improved significantly; all the R 2 values are over 90% and 
close to 1, and the intercepts and slopes of the regressions are close to and 1. However, a 2 



values are all close to zero. The zero error component happens in REML estimation (Searle 
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et al. 1992), especially with high dimensional parameter spaces. 



Table [T] shows that the performance is much better for the two p-REML procedures. Not 
only is the fitting of nonparametric functions very good, but the estimate of error variance 
component a 2 is close to the true value. As expected, fitting with extra genes introduces more 
error, which results in the increase of a 2 . This is because fitting irrelevant genes is equivalent 
to introducing more noise into the model. However, the results show little difference in fitting 
/ a 's for differently used gene numbers. Increasing the number of observations is expected 
to improve the fitting performance. Although overall there is no much difference between 
n = 100 and 150, there is slight improvement in fitting the P-E interaction effect. This can 
be seen from the fact that R 2 increases and the slope of regressing f xz on f xz is closer to 1 
for n = 150. 

The overall goodness-of-fit using p-RMEL is very good, except there are small biases: 
the regression slope of f z on f z is slightly smaller than 1 and the the regression slope of f xz 
on f xz is slightly larger than one. This means that f z is overestimated and f xz is slightly 
underestimated. However, for each f z and f xz , the fitted results can explain most of the 
variations as all the R 2 values are very close to 1. We also realized that the fitting of f z + f xz 
is much better than individual ones (the regression parameters of f z + f xz on f z + f xz are 
not shown), which is easy to be understood if we can treat r = r z + r xz as one random effect 
with covariance r z K z + t xz K xz . This indicates that there is no bias in fitting f z + f xz , but the 
weight between f z and f xz might be biased. The reason for this can be understood from the 
interaction kernel expression ([6]). It can be seen that if the entries of matrix xx' + k x (x,x') 
are close to each other, then t z K z + r xz K xz is nothing more than a scalar times K z , and we 
will have overestimation of f z . However, this bias is not too significant, because the good fit 
of f z + f xz and the high R 2 values of fitting f xz indicate that it has little influence on testing 
either the overall pathway or the P-E interaction effect. 
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5.2 Test Study 

To obtain better convergence, for the rest of this paper we adopt the Marquardt procedure 
as a scoring method. With the Marquardt method we have flexible iteration steps, this is 

1 dl R 



0(fc+i) = 0(fc) + z{0 {k) ) + S (k) I 



dO 



where Ir, I, and 6 are replaced by the counterparts of the p-REML procedure when it is 
required. The scalar 5^ partially determines the step size and I is the identity matrix. If 
^(fc) j g sm all, the procedure approximates a scoring method. If 5^ is large, a small step is 
taken in approximately the direction of the scoring method. We modify 5^ accordingly to 
achieve increasing likelihood. In this paper, we start from 5^ = (1.0 x 10~ 5 ) x n J^be?of e's 
to make the initial step size as large as possible. 

We first studied the performance of RLRT of testing two zero variance components under 



hypothesis (24). In this simulation study we are particularly interested in two issues: how 
RLRT performs at different fixed p values since we prefer to estimate the parameters with 
p fixed and how the performance degrades with irrelevant genes included in the model. The 



true model used and the data generating method are the same as described for (34) in Section 



5.1 For both issues, we first set a = and vary b, and then set 6 = and vary a. It turns 



out the test is very powerful when both a and b are not equal to zero, so no simulation on 
this situation is shown here. For all cases, the total number of simulation runs is 1000 times. 



In addition, the function f x (-) in (34) has a trivial nonlinear component, so we can apply 
RLRT in this simulation. 

For the first issue, we consider the case where the sample size is n = 100, and both the 
true and used gene numbers are p = 30. Table [2] presents the Type II errors and powers of 



testing hypothesis (24) for 2 sets of {a, b} values at 4 different p values (one is estimated). 



In general, the power curve of RLRT does not depend on p very much. Liu et al. (2007) 
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revealed the same phenomena for the score test of a single variance component within a 
model with only one random effect. This is because moderate differences of p do not change 
the structure of the covariance matrix very much, except for extreme values such as p — > 
or p — > oo, with which the covariance matrix turns to an identity matrix or a matrix of 
ones. Note that the empirical Type II errors of all situations are around 0.03, smaller than 



the nominal one. The reason could be the approximation of (28) due to the assumption, 
Q x = A- 1 « 0. 

To test two zero variance components with extra genes, we consider simulations with the 
sample sizes n = 60 and n = 35. The latter mimics the Type II diabetes data where the 
total subjects under study are n = 35. Fitting with the equal true and used gene numbers 
is compared to fitting with an extra 20 irrelevant genes. The results in Table [3] show that, 
when fitting with extra genes, the power decreases as expected but not dramatically, which 
means that the model we proposed can be applied to pathway data for which only some of 
the genes are related to the responses. In addition, comparing Table [2] and [3] shows that the 
power does decrease with the sample size n. 

The simulation study for testing P-E interaction using RLRT and the score test is carried 
out using a new setup for the data generation. We continue using the same nonparametric 
expression ( p4| except with true gene number p = 5; that is, simply replacing f z (-) and f xz (-) 
as f z ( z J) = a ■ z 4 (5) exp (-0.2|z|f } ) /5 and f xz (xi,zj) = b ■ e x ^ 10 sin (zf } ) cos (z f (5) ) /8, 

where zf ) = Y%=i z ij> 1*1^ = Z^=i \ Z H I/ 5 and = Y?j=i z ij/$- x i, z ij an d e i are generated 
the same way as before. Note the function form changes when the gene number is different 



in (34). We use this setup to compare two test procedures for testing (29). For the score 
test, we first estimate the parameters using p-REML and then calculate the statistics using 



expressions (17) and (33). The results are listed in Table |4j Again, we see that the test's 
power does not depend on p. The results indicate that the RLRT are slightly lower in power 
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and that the type I errors of the two test methods are all closer to the nominal 5% from 
different directions. These results indicate we can apply both test methods under suitable 
conditions. 



6 Application to Type II Diabetes Data 



We applied our mixed model (10) to a set of diabetes data from Mootha et al. (2003). 
They utilized the HGC-133a Affymetrix genechip with 22,283 genes to study 17 normal 
glucose tolerance individuals vs. 18 Type II diabetes mellitus patients. The 22,283 genes 
make up a total of 251 pathways. The goal of this study is to identify pathways with the 
highest significant overall pathway effect when an environmental variable, body mass index, 
is present in the model, and from them identify pathways with significant P-E interaction 
effect. Therefore, there are a total of 251 sets of data, each having n = 35 observations. 
Corresponding to each individual pathway, the data set contains (y, X, Z), where y is the 
outcomes of glucose level, X has the same meaning as before with the first column of l's 
and the second column as the body mass index data of 35 subjects, and Z(n x p) is the gene 
expression levels of each pathway, which contains the number of genes ranging from p = 3 
to p = 543. 

The fitting results of the top 20 pathways are listed in Table [5] ranked ascendingly in the 
p- value of testing the overall pathway effect using RLRT D. It has almost an identical order 
of the magnitude as the D. It can be seen that 19 out of the 251 pathways are significant. 
For each pathway, the variance components are estimated using p-REML methods and the 



standard error of those parameters including a are calculated using information matrix (17) 
with the p-REML estimates plugged in. Again, the initial values for the variance parameters 
are (cr 2 , A" 1 , A" 1 , X~^) T = (0.001, 1, 1, 1) T and p is fixed at the average of ||z — z'|| 2 of different 
pairwise observations, which ranges from 0.1 to 1.8 for different pathways. 
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To show an overall view of the fitting results for 251 pathways, Figure [3] plots the four 
estimated variance components in the same order of the p- value of RLRT D. The straight 
dashed line divides the significant and insignificant pathways of RLRT. The error compo- 
nents, o" 2 's, are around the constant 3.0 except for those top significant pathways. This is 
consistent with the test results indicating that for those pathways with genes relevant to the 
responses, the error is reduced since part of the variation of the responses is explained by 
pathway main effect or P-E interaction effect. The variations of f x and f z seems to compen- 
sate for each other. For the top 50 pathways, f^'s are close to zero and f z values are large. 
On the other side, for those pathways which are ranked as lower than 50, f z values are very 
small and f x values increase. This indicates that for those pathways not relevant enough to 
the response, part of the variation of response is explained by the nonlinear relationship of 
the responses and the environmental variable. The variation of f xz seems less dramatic than 
other random effects. It does not decrease to zero for those non significant pathways, and 
stabilizes after the top 100 pathways. However, using the test of RLRT d, we show that the 
lower ranked pathways, ranked as [50, 251], are not significant in the interaction effect. 
These results suggest that the body mass index is important in explaining the relationship 
between the glucose level and the genetic pathway since many pathways that are significant 
in the overall pathway effect are either significant in the interaction effect or not. 

Because the distribution for D is asymptotic, the p-value calculated based on 35 observa- 
tions may not be as accuracte as expected. Hence, we carried out a permutation test process 
to obtain the exact distribution of D as follows: 



Step 1: We fit the observed data with the full model (10) and reduced model under 



hypothesis (24) using the p-REML approach. In both models, we set t x = since we 



assume that r x is insignificant when deriving (28). Then we obtained test statistics D 



and calculated the residual e = r z + ?xz + c using the fitted results of the full model 
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from y = X(3 + r z + r xz + e. 



Step 2: We permuted the residual e to get new e* and simulate outcomes as y* 



Step 3: Based on y*, X and Z, we fit the full model (10) and reduced model under hy- 



pothesis (23) again using the p-REML approach and then calculated the test statistics 
D*. 

• Step 4- We repeated Steps 2-3 for a large number of times (e.g. 10,000 times). 

• Step 5: We obtained the empirical p- value of the RLRT by formula p- value = (number 
of D*'s greater than D) (total number of _D*'s). 

The p- value of the permutation test of D as well as the RLRT D are listed in Table [6] in 
the same order of Table [5] for the top 20 pathways. Note that for RLRT if the sample size 



is too small such that the information matrix (22) is non positive definite, </> in (28) cannot 



be calculated, so we are not able to get the asymptotic distribution of D. However the 



information matrices of the 251 pathways under hypothesis (24) are all positive definite (not 



true under hypothesis (29)), so we are able to test the overall pathway effect for all using 
RLRT D. The results of both tests are similar to each other with respect to the general 
rank of the significance, specifically both tests have the same top 3 pathways, which are 
pathways 73, 274, and 230. In addition, most of the p-values of the permutation tests are 
slightly larger than those of RLRT, as expected, since the permutation test is usually more 
conservative. Table |6]also labels those significant pathways ranked in the top 50 list according 



to the global score test (Goeman et al. , 2004) and the forest tree method Pang et al. (2006); 



Pang and Zhao (2008), which do not take into account the environmental variable in their 



models. Our approach identified pathways that have either significant main pathway effect, 
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the interaction effect, or both, while other methods determined many as having a significant 
main pathway effect only. Through following one zero variance component test, we also 
discovered that some pathways have a significant P-E interaction effect although they may 
not have a significant main pathway effect. 

Furthermore, the p- values of RLRT d are also listed in Table [6j There are pathways for 
which we are unable to calculate d because the information matrix is not positive definite. 
In Figure [4] the p- values of RLRT D and RLRT d of all pathways are plotted for comparison. 
Among the top 50 that are significant in overall pathway effect, only part of them are 
significant in the interaction effect, but for the remaining 151 pathways, none are significant 
in either interaction effect or overall pathway effect. Similar to RLRT D, a permutation test 
process for the exact distribution of RLRT d is introduced here: 



Step 1: We fit the observed data with the full model (10) and reduced model under 



hypothesis (29) using the p-REML approach. Again in both models we assume that 
t x is negligible. Then we obtained d, and calculated the residual e = r xz + e using the 
fitted results of the full model from y = X/3 + r z + r xz + e. 

Step 2: We permuted the residual eo to get new e* Q and simulated outcomes as y* = 
X(3 + r z + e*. 

Step 3: Based on y*, X and Z, we fit the full model and reduced model under hy- 



pothesis (29) again using the p-REML approach and then calculated the test statistics 
d*. 

Step 4- We repeated Steps 2-3 a large number of times (e.g. 10,000 times). 

Step 5: We obtained the empirical p- value of the RLRT by formula p- value = (number 
of rf*'s greater than d) -j- (total number of d*'s). 
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The permutation test results of RLRT d are close to those of RLRT d in the 20 pathways, 
but it is difficult to tell which one is more conservative. 



We also calculated the p- values of testing H (29) using the score test approach for the 
top 20 pathways. Compared with the RLRT d and RLRT d permutation tests, the p-values 
of the score test is similar in sense of determining the significant pathways at the 5% level. 
Among these top 20 pathways with significant overall pathway effect, the pathways with 
insignificant interaction effect are {229, 152, 16, 236, 144, 151, 103, 271, 101, 158} according to 
the score test, and {229, 152, 16, 236, 144, 151, 14, 103, 271, 150, 158} according to the RLRT 
d permutation test. Note that the difference of the two sets, {14, 101, 150}, all have marginal 
p-values for the two tests at the 5% level. If they are removed from the two sets, both tests 
have identical pathways which have insignificant P-E environment interaction effects. 

Based on the three tests procedures, we identified the pathways with a significant 
P-E environment interaction effect for all tests among the top 20 pathways. They are 
{73, 274, 230, 173, 228, 172} pathways at the 5% level. These pathways are known to be 
related to Type II diabetes. Pathway 73 is a Cysteine metabolism pathway. It is known that 
taurine (a semi-essential sulphur amino acid) derived from cysteine metabolism can prevent 



diabetes mellitus and/or insulin resistance (Franconi et al. , 2006). Pathway 274 is involved in 
the Urea cycle and metabolism of amino groups, which has also been reported to be related 



to Type II diabetes flCzyzyk et al.[ [1989]). Pathway 230 is OXPHOS.HG-UlSSA.probes 
pathway. It has been reported that genes involved in oxidative phosphorylation are co- 
ordinately upregulated with fasting hyperglycaemia in the livers of patients with Type II 



diabetes ( Misu et al. , 2007 ) . The transcription levels of a class of genes involved in oxida- 



tive phosphorylation mechanisms are consistently lower in diabetics than in controls (Mootha 



et al. 


2003 


Misu et al. 


2007 



pathway. It is known that Type II diabetes mellitus also induces an increased urinary 
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excretion of total glycosaminoglycans (Juretic et al. 2002). Pathway 228 is involved in 



Oxidative phosphorylation. It is known to be related to diabetes (Misu et al. , 2007; [Mootha 



et al. 2003, 2004). This pathway is a process of cellular respiration in humans (or in gen- 
eral eukaryotes) and contains coregulated genes across different tissues and is related to 
insulin/glucose disposal. It is associated with ATP synthesis, a pathway involved in energy 
transfer. Pathway 172 is MAP00530-Aminosugars-metabolism pathway. Aminosugars (= 
glucosamine) have no effect on fasting blood glucose levels, glucose metabolism, or insulin 
sensitivity at any oral dose level in healthy subjects, individuals with diabetes, or those with 



impaired glucose tolerance (Simon et al. 2011). 



7 Discussion 

The development of a pathway-based mixed model to relate the response with genetic path- 
ways is motivated by the fact that genes always interact with the environmental variables. 
Modeling the P-E interaction effect can help in further understanding the biological mecha- 
nisms underlying diseases and facilitate the discovery of potential biomarkers. However, no 
existing approaches are able to jointly analyze pathways with the environmental variables 
when P-E interaction exists. 

In this paper, we have addressed a mixed effects model connecting with kernel machine 
methods and smoothing spline, so that we can analyze the genetic pathway data with a 
continuous clinical outcome when the P-E interaction effect is present in the model. We 
demonstrated the application of our method to a pathway data of Type II diabetes. Our 
approach allows us to evaluate the pathway effect and its interaction with the environmental 
variables by estimating the corresponding variance components and testing the significance 
of those parameters. Because of the high dimensional parameters space, there are usually 
some difficulties in solving the REML equations, such as non-positive error estimated. We 
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reduced the parameter space dimension in solving REML equations by introducing the p- 
REML approach to estimate the variance components so that the error component is always 
in the parameter space. The p-REML approach not only allows us to solve the REML 
equations efficiently, but also provides an efficient choice in testing one or two zero variance 
components besides the global score test, i.e. the profile restricted likelihood ratio test for 
testing the overall pathway effect or P-E interaction. 

Modeling the linear mixed model with a kernel machine has other advantages. It allows 
us to choose appropriate kernels to construct the variance matrix of the random effect as well 
as the interaction random effect in accordance with the data structure. In this paper, we 
focused on the Gaussian kernel, but when the sample size is large so that the computation 
becomes expensive, some less computational intensive alternatives to Gaussian kernel are 
available, such as rational quadratic kernel: k(z T ,z' T ) = 1 — \\z — z'\\ 2 /(\\z — z'\\ 2 + c). 
Other kernels, such as a polynomial kernel, an exponential kernel, an inverse multiquadric 
kernel, etc., have also been examined and can replace the Gaussian kernel in appropriate 
situations. Note that these kernels are similar to the Gaussian kernel in terms of reducing the 
dimension of the covariates through measuring the similarity of z and z' . To some extent, 
this may be a disadvantage of the kernel method since there may be some information lost 
beyond the similarity of the two attributes. 

Possible extensions of our method include applying the interaction kernel machine to 
generalized linear models. Logistic kernel machine regression with a Gaussian kernel has 



been developed by Liu et al (2008 ), but no interaction between the genetic pathway effect and 
environmental variable has been considered. By adding the interaction kernel machine to a 
generalized linear model, our method can be applied in more general genomewide association 
studies, especially in the case-control studies of G/P-E interaction. The second potential 
extension of our method is to consider a higher dimension of environmental variables xf, 
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such as bivariate xj = (xa, a^), longitude and latitude data, and the nonparametric function 
f x (xj) can be fitted using thin plate splines (Gu and Wahba, 1993). With the kernel of the 



thin plate splines, we can construct the interaction function space kernel similarly. This 
extension may have wider applications such as in spatial data where the interaction between 
location and other high dimensional covariates are particularly interesting. 

We note that we evaluate the interaction between each pathway and environmental vari- 
able. It is known that pathways are not independent of each other because of shared genes 
and interactions among pathways as well as their interaction with environmental variables, 
making it difficult to adjust the p-value due to the complex dependency structure. Be- 



cause existing multiple comparison methods based on false discovery rates (Benjamini and 



Hochberg 


1995 


Storey 


2002 



take into account the interaction between genes and environmental variables, they are not 
applicable in such a complicated situation as our problem. Developing a multiple comparison 
method will be an interesting and challenging problem because of the complex dependence 
structure among pathways and environmental variables. 
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Table 1: Assessments of estimating f x ,f z and f xz simulated by (34) using REML and p- 
REML procedures with p estimated from initial value 2 or fixed at 2. Total runs number 
200 for each scenario, and the average values are reported. 







fitted p 




P 




fx ^ fx 












fxz ~ fxz 






n 


(true p) 


a 2 


(initial p) 


Int 


Slope 


Ft 2 


int 


Slope 


Ft 2 


Int 


Slope 


Ft 2 






30(30) 


0.34 


2130(2) 


-0.38 


1.00 


0.97 


-0.01 


10.51 


0.90 


-0.14 


5.19 


0.46 




100 


40(30) 


0.29 


1824(2) 


-0.55 


1.06 


0.96 


0.01 


11.65 


0.89 


-0.11 


4.17 


0.50 


REML 




50(30) 


0.32 


1929(2) 


-1.53 


1.26 


0.96 


-0.02 


16.07 


0.87 


-0.13 


5.28 


0.48 


P 




30(30) 


0.26 


1604(2) 


-1.15 


1.17 


0.98 


0.09 


5.87 


0.93 


-0.17 


3.70 


0.54 


estimated 


150 


40(30) 


0.29 


1814(2) 


-0.68 


1.18 


0.97 


-0.09 


8.65 


0.91 


-0.15 


3.95 


0.48 






50(30) 


0.35 


2054(2) 


-1.24 


1.18 


0.97 


0.06 


12.32 


0.90 


-0.15 


4.79 


0.45 






30(30) 


6.9e-10 


2 


0.10 


0.99 


0.99 


0.01 


0.85 


0.99 


0.01 


1.44 


0.90 




100 


40(30) 


8.6e-10 


2 


0.13 


0.98 


0.98 


0.02 


0.86 


0.98 


0.00 


1.40 


0.90 


REML 




50(30) 


8.5c-10 


2 


0.16 


0.98 


0.96 


0.01 


0.86 


0.98 


0.01 


1.41 


0.88 


P 




30(30) 


8.5c-10 


2 


0.05 


0.99 


0.99 


0.01 


0.84 


0.99 


0.01 


1.41 


0.93 


fixed 


150 


40(30) 


8.7c-10 


2 


-0.00 


1.00 


0.99 


0.00 


0.84 


0.99 


-0.01 


1.40 


0.92 






50(30) 


7.1e-10 


2 


0.10 


0.99 


0.99 


0.02 


0.85 


0.99 


0.00 


1.38 


0.91 






30(30) 


0.04 


3.96(2) 


-0.24 


1.04 


1.00 


0.01 


0.85 


0.99 


-0.04 


1.38 


0.90 




100 


40(30) 


0.07 


3.36(2) 


-0.19 


1.03 


1.00 


-0.01 


0.87 


0.99 


-0.05 


1.46 


0.89 


p-REML 




50(30) 


0.09 


4.72(2) 


-0.31 


1.04 


1.00 


0.06 


0.90 


0.98 


-0.04 


1.44 


0.88 


P 




30(30) 


0.02 


3.00(2) 


-0.28 


1.04 


1.00 


0.01 


0.85 


0.99 


-0.05 


1.29 


0.92 


estimated 


150 


40(30) 


0.02 


3.63(2) 


-0.29 


1.04 


1.00 


0.01 


0.86 


0.99 


-0.04 


1.29 


0.91 






50(30) 


0.04 


3.19(2) 


-0.13 


1.02 


1.00 


0.01 


0.85 


0.99 


-0.02 


1.37 


0.91 






30(30) 


0.04 


2 


-0.08 


1.01 


1.00 


0.02 


0.85 


0.99 


-0.01 


1.64 


0.91 




100 


40(30) 


0.11 


2 


-0.17 


1.03 


0.99 


-0.00 


0.88 


0.98 


-0.03 


1.52 


0.91 


p-REML 




50(30) 


0.11 


2 


-0.12 


1.02 


0.99 


-0.00 


0.90 


0.98 


-0.01 


1.38 


0.91 


P 




30(30) 


0.02 


2 


-0.08 


1.01 


1.00 


0.02 


0.86 


0.99 


-0.01 


1.34 


0.93 


fixed 


150 


40(30) 


0.03 


2 


-0.11 


1.02 


1.00 


-0.01 


0.85 


0.99 


-0.05 


1.37 


0.92 






50(30) 


0.04 


2 


-0.09 


1.01 


1.00 


0.02 


0.86 


0.99 


-0.02 


1.44 


0.92 



Table 2: Simulation study for RLRT of overall pathway effect with p fixed at different values 
and estimated. Simulated samples size n = 100, and both used and true gene number equal 

top = 30. 

p 6 = 0.2 0.35 0.5 1 

2 0.03 0.34 0.91 1.00 1.00 

5 0.02 0.34 0.89 0.99 1.00 

a = 

10 0.02 0.30 0.88 0.99 1.00 

estimated 0.03 0.33 0.87 0.99 1.00 
a = 0.05 0.1 0.2 0.5 
2 0.03 0.07 0.37 0.96 1.00 

5 0.02 0.07 0.37 0.95 1.00 

6 = 

10 0.02 0.06 0.34 0.91 1.00 

estimated 0.03 0.06 0.34 0.93 1.00 



Table 3: Simulation study for RLRT of overall pathway effect with fitted genes number p 
equal or larger than true one p = 30. Simulated samples size n = 60 and n = 35. The 
parameter p is fixed at 2. 

n usedp 6 = 0.2 0.35 0.5 f 



30 0.03 0.18 0.57 0.88 1.00 

50 0.03 0.15 0.48 0.76 0.99 

30 0.04 0.10 0.27 0.46 0.85 

50 0.03 0.08 0.23 0.38 0.78 



a = 



6 = 



60 



35 



a = 0.1 0.2 0.5 1.5 
30 0.03 0.15 0.51 0.72 0.72 
50 0.03 0.13 0.41 0.72 0.76 



60 



35 



30 0.04 0.09 0.25 0.56 0.63 
50 0.03 0.05 0.18 0.43 0.55 



Table 4: Simulation study for PLRT and score test of P-E interaction with p fixed at different 
values. Fitted and used gene numbers are equal to p — 5, and n = 1 00. 

p b = 0.1 0.2 0.35 0.5 0.8 f 
2 0.04 0.24 0.58 0.95 f.00 f.00 f.00 
RLRT 5 0.04 0.24 0.64 0.98 f.00 f.00 f.00 
fO 0.03 0.24 0.67 0.97 f.00 f.00 f.00 
score 2 0.08 0.31 0.68 0.98 1.00 1.00 1.00 
test 5 0.06 0.30 0.72 0.97 1.00 1.00 1.00 
10 0.06 0.26 0.72 0.98 1.00 1.00 1.00 



Table 5: Estimated parameters of top 20 pathways obtained from p-REML and ranked by 



p- values of testing RLRT D. The numbers in the round brackets are the standard errors. 



pathway 
ID 


gcnc# 


$0 






a 2 




Tz 




fixed 
P 


RLRT 
D 


RLRT 
p- value 


73 


11 


5.09(1.51) 


-0.01(0.21) 





08 


0.39) 


1.0c-ll(0.02) 


6.09(3.12) 


17.7(11.8) 


0.457 


12.2 


0.001 


274 


16 
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49 
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4.83 


0.030 


150 


21 


5.98(0.94) 


0.19(0.14) 


2 


54 


1.12) 


7.5c-ll(0.02) 


0.97(2.10) 


5.75(7.65) 


1.161 


4.66 


0.033 
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2 


75 


0.99) 
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3 
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Table 6: P-values of different tests for top 20 pathway significant in the overall pathway- 
effect. Columns 2 and 3 are labels indicating appearance in the top 50 list of other methods 
or not. Missing values in column 6 is because the information matrix is not positive definite. 
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Forest 


RLRT 


permutation 


RLRT 


permutation 


score test 


ID 


Score Test 


Tree 


test for D 


test for D 


test for d 


test for d 


for U T „ 


73 


Yes 


Yes 


0.001 


0.001 


0.002 


0.001 


0.005 


274 


Yes 


No 


0.006 


0.011 


0.025 


0.013 


0.016 


230 


Yes 


Yes 


0.006 


0.010 


- 


0.025 


0.007 


229 


Yes 


Yes 


0.012 


0.020 


- 


0.138 


0.062 


152 


No 


No 


0.014 


0.015 


0.179 


0.303 


0.163 


16 


Yes 


Yes 


0.017 


0.027 


0.126 


0.147 


0.058 


173 


Yes 


Yes 


0.017 


0.020 


0.017 


0.018 


0.002 


236 


No 


No 


0.019 


0.021 


0.133 


0.119 


0.104 


144 


Yes 


Yes 


0.019 


0.020 


0.076 


0.072 


0.106 


151 


No 


No 


0.019 


0.023 


0.205 


0.262 


0.146 


14 


Yes 


No 


0.024 


0.031 


0.113 


0.054 


0.046 


228 


Yes 


Yes 


0.028 


0.035 


0.032 


0.024 


0.006 


103 


No 


Yes 


0.030 


0.039 


0.121 


0.106 


0.086 


271 


No 


No 


0.030 


0.037 


0.148 


0.142 


0.110 


150 


No 


No 


0.033 


0.034 


0.080 


0.062 


0.044 


172 


No 


No 


0.039 


0.044 


0.016 


0.015 


0.009 


133 


No 


No 


0.044 


0.057 


0.053 


0.043 


0.018 


8 


Yes 


Yes 


0.045 


0.052 


0.051 


0.038 


0.032 


101 


No 


No 


0.045 


0.044 


0.068 


0.049 


0.056 


158 


Yes 


No 


0.056 


0.054 




0.343 


0.560 




Figure 1: Diagram of the parameter space of RLRT for testing two zero variance components 
(a), and testing the P-E interaction effect (b). 
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jure 2: Selected example of fitting results of setting 1. Because of the high dimensionality, 
r xz and f are plotted vs. the observation index only. 
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Figure 3: The estimated variance components of a 2 , f x , f z , f xz for 251 pathways ordered by 
p- values of testing the overall pathway effect. The dash lines separate the significant and 
insignificant pathways at 5% level. 




Index 

Figure 4: The p-values of testing overall pathway effect (RLRT D) and P-E interaction effect 
(RLRT d) for 251 pathways. The vertical dash line divides the significant and insignificant 
pathways of overall pathway effect test, and the horizontal dash line indicates 5% significant 
level. Some p-values of RLRT d are missing because the information matrix is not positive 
definite. 



Appendix A The Representation of the Natural Cubic 

Spline 



Following Green and Silverman (1994), the representation of the natural cubic spline (ph in 



section 2J2 is called the value-second derivative representation. Details for defining matrices 
B and M are shown as the following. 

Suppose f x is the natural cubic spline with n distinct x\ <,...,< Define 

fx,i = fx(x i) and 7i = for i = 1, n 

By the definition of natural cubic spline, 71 = 7 n = 0. Let f x stands for the vector 
(/e,1j •••) fx,n) T an d let 7 = (72, 7„_i) T where 7 is a (n — 2) x 1 vector with the ele- 
ment index starting at i — 2. Now define two matrices, Q and R. Let ^ = t i+1 — £j for 
i = 1, n — 1. Let Q be the n x (n — 2) matrix with entries g^-, for i — 1, ...,n — 1 and 
j = 2, ...,n- 1, given by 

Qj-ij = h J-n Qjj = ~ h j-i ~ h J^ and Qj+ij = h ] 1 ( 35 ) 

for j — 2, n — 1 and g^- = for \i — j\ > 2. The columns of Q are indexed in the same 
way as the elements of 7 starting at j = 2, so that the first element of Q is qw 

R is a (n — 2) x (n — 2) symmetric matrix with elements r^, for % and j running from 2 
to n — 1, given by 

r a = n{hi-i + fri) for i = 2, n - 1, 

3 ! (36) 

r M+ i = r i+ i < = for z = 2, n - 2, 


and Tij = for \i — j\ > 2. 

The matrix R is strictly diagonal dominant and strictly positive definite. Using the 



Cholesky factorization that avoids taking the square roots (Green and Silverman, 1994) 



Section 2.6.1, we can factorize R as 

R = UAU 1 



where A is a diagonal matrix and U is a lower triangular band matrix with diagonal elements 
all equal to 1. Since R are strictly positive definite, all diagonal elements of A are positive, 
R^ 1 = (A 1 / 2 f/ T ) _1 (f/A 1 / 2 ) -1 . The penalty matrix M can be expressed as 

M = QR^Q 7 = Q{X 1 ' 2 U T )-\UA 1 ' 2 Y l Q T = LL T , (37) 

where L = Q^A 1 ^ 2 ^)^ 1 . The B matrix thus is calculated by 

b = l{l t l)- 1 = g(A 1 / 2 t/ T )- 1 {[{a 1 i 2 u t )- x \ t q t q{a 1 ' 2 u t )- 1 \ 
= g(A 1 / 2 f/ T )- 1 (A 1 / 2 f/ T )(g T g)- 1 (A 1 / 2 f/ T ) T 

= Q(Q T Q)- 1 UA 1 / 2 . 



-l 



The Theorem 2.1 in Green and Silverman (1994) states that the vectors f x and 7 specific a 



natural cubic spline f x if and only if the condition g T fa; = Ry is satisfied. If this condition 
is satisfied then the roughness penalty will satisfy 



n-l 



Jo j=1 n j 

= j t Rj = ^QR~ 1 Q T f x = f x Mf x . 

In the above derivation we assumed that x®,i = l,...,n, were distinct and ordered, so the 
rank of the penalty matrix M is n — 2 and B is a n x (n — 2) matrix. In our model, we 
shall have r distinct and ordered x®, i — l,...,r, from the observed data where 
r < n and x^s may not be ordered. Based the r x°'s, B is a r x (r — 2) matrix. Thus we 



will use a n x r incidence matrix defined in a way similar to that given by Green and 



Silverman (1994), Section 4.3.1, such that B = NB, where the left B is what we shall use 
in the model, and the right B is calculated based on r distinct x°'s. 

53 



